Exploring Red Wine Quality by Allan Reyes

About

In this exercise, I will explore a data set on wine quality and physicochemical properties. The objective is to explore which chemical properties influence the quality of red wines. I’ll start by exploring the data using the statistical program, R. As interesting relationships in the data are discovered, I’ll produce and refine plots to illustrate them. The data is available for download here and background information is available at this link.

Summary Statistics

Let’s run some basic functions to examine the structure and schema of the data set.

str(df)
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
summary(df)
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Since we’re primarily interested in quality, it would also be interesting to see basic statistics on that as well.

summary(df$quality)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Some initial observations here:

  • There are 1599 observations of 13 numeric variables.
  • X appears to be the unique identifier.
  • quality is an ordered, categorical, discrete variable. From the literature, this was on a 0-10 scale, and was rated by at least 3 wine experts. The values ranged only from 3 to 8, with a mean of 5.6 and median of 6.
  • All other variables seem to be continuous quantities (w/ the exception of the .sulfur.dioxide suffixes).
  • From the variable descriptions, it appears that fixed.acidity ~ volatile.acidity and free.sulfur.dioxide ~ total.sulfur.dioxide may possible by dependent, subsets of each other.

Univariate Plots

To first explore this data visually, I’ll draw up quick histograms of all 12 variables. The intention here is to see a quick distribution of the values.

Univariate Analysis

Wine Quality

I first looked at wine quality. Although it has a discrete range of only 3-8, we can roughly see that there is some amount of normal distribution. A large majority of the wines examined received ratings of 5 or 6, and very few received 3, 4, or 8. There’s not much more we can do with this histogram, as both decreasing or increasing bin sizes would distort the data.

Given the ratings and distribution of wine quality, I’ll instantiate another categorical variable, classifying the wines as ‘bad’ (rating 0 to 4), ‘average’ (rating 5 or 6), and ‘good’ (rating 7 to 10).

##     bad average    good 
##      63    1319     217

Distributions and Outliers

  • It appears that density and pH are normally distributed, with few outliers.
  • Fixed and volatile acidity, sulfur dioxides, sulphates, and alcohol seem to be long-tailed.
  • Qualitatively, residual sugar and chlorides have extreme outliers.
  • Citric acid appeared to have a large number of zero values. I’m curious whether this is truly zero, or if it is a case of non-reporting.

When plotted on a base 10 logarithmic scale, fixed.acidity and volatile.acidity appear to be normally-distributed. This makes sense, considering that pH is normally distributed, and pH, by definition, is a measure of acidity and is on a logarithmic scale. Curiously, however, citric.acid, did not appear to be normally-distributed on a logarithmic scale. Upon further investigation:

length(subset(df, citric.acid == 0)$citric.acid)
## [1] 132

It is apparent that 132 observations had a value of zero. This yields some concerns on whether or not these 132 values were reported or not, considering that the next ‘bin’ higher contains only 32 observations.

## Warning: Removed 79 rows containing non-finite values (stat_bin).

## Warning: Removed 80 rows containing non-finite values (stat_bin).

## Warning: Removed 79 rows containing non-finite values (stat_bin).

Short questions

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

While exploring the univariate histogram distributions, there did not appear to be any bimodal or multimodal distributions that would warrant sub-classification into categorical variables. I considered potentially splitting residual.sugar into ‘sweet wine’ and ‘dry wine’, but the Wikipedia source cited a residual sugar of greater than 45 g/L or g/m^3 to classify as a sweet wine.

Did you create any new variables from existing variables in the dataset?

I instantiated an ordered factor, rating, classifying each wine sample as ‘bad’, ‘average’, or ‘good’.

Upon further examination of the data set documentation, it appears that fixed.acidity and volatile.acidity are different types of acids; tartaric acid and acetic acid. I decided to create a combined variable, TAC.acidity, containing the sum of tartaric, acetic, and citric acid.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I addressed the distributions in the ‘Distributions’ section. Boxplots are better suited in visualizing the outliers.

In univariate analysis, I chose not to tidy or adjust any data, short of plotting a select few on logarithmic scales. Bivariate boxplots, with X as rating or quality, will be more interesting in showing trends with wine quality.

Bivariate Plots and Analysis

To get a quick snapshot of how the variables affect quality, I generated box plots for each.

From exploring these plots, it seems that a ‘good’ wine generally has these trends:

  • higher fixed acidity (tartaric acid) and citric acid, lower volatile acidity (acetic acid)
  • lower pH (i.e. more acidic)
  • higher sulphates
  • higher alcohol
  • to a lesser extend, lower chlorides and lower density

Residual sugar and sulfur dioxides did not seem to have a dramatic impact on the quality or rating of the wines. Interestingly, it appears that different types of acid affect wine quality different; as such, TAC.acidity saw an attenuated trend, as the presence of volatile (acetic) acid accompanied decreased quality.

By utilizing cor.test, I calculated the correlation for each of these variables against quality:

##        fixed.acidity     volatile.acidity          citric.acid 
##           0.12405165          -0.39055778           0.22637251 
##          TAC.acidity log10.residual.sugar      log10.chlordies 
##           0.10375373           0.02353331          -0.17613996 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##          -0.05065606          -0.18510029          -0.17491923 
##                   pH      log10.sulphates              alcohol 
##          -0.05773139           0.30864193           0.47616632

Quantitatively, it appears that the following variables have relatively higher correlations to wine quality:

  • alcohol
  • sulphates (log10)
  • volatile acidity
  • citric acid

Let’s see how these variables compare, plotted against each other and faceted by wine rating:

The relative value of these scatterplots are suspect; if anything, it illustrates how heavily alcohol content affects rating. The weakest bivariate relationship appeared to be alcohol vs. citric acid. The plots were nearly uniformly-distributed. The strongest relationship appeared to be volatile acididty vs. citric acid, which had a negative correlation.

Examining the acidity variables, I saw strong correlations between them:

## 
##  Pearson's product-moment correlation
## 
## data:  df$fixed.acidity and df$citric.acid
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6438839 0.6977493
## sample estimates:
##       cor 
## 0.6717034

## 
##  Pearson's product-moment correlation
## 
## data:  df$volatile.acidity and df$citric.acid
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5856550 -0.5174902
## sample estimates:
##        cor 
## -0.5524957

## 
##  Pearson's product-moment correlation
## 
## data:  log10(df$TAC.acidity) and df$pH
## t = -39.663, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7283140 -0.6788653
## sample estimates:
##        cor 
## -0.7044435

Most notably, base 10 logarithm TAC.acidity correlated very well with pH. This is certainly expected, as pH is essentially a measure of acidity. An interesting question to pose, using basic chemistry knowledge, is to ask what other components other than the measured acids are affecting pH. We can quantify this difference by building a predictive linear model, to predict pH based off of TAC.acidity and capture the % difference as a new variable.

The median % error hovered at or near zero for most wine qualities. Notably, wines rated with a quality of 3 had large negative error. We can interpret this finding by saying that for many of the ‘bad’ wines, total acidity from tartaric, acetic, and citric acids were a worse predictor of pH. Simply put, it is likely that there were other components–possibly impurities–that changed and affected the pH.

As annotated previously, I hypothesized that free.sulfur.dioxide and total.sulfur.dioxide were dependent on each other. Plotting this:

## `geom_smooth()` using method = 'gam'

## 
##  Pearson's product-moment correlation
## 
## data:  df$free.sulfur.dioxide and df$total.sulfur.dioxide
## t = 35.84, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6395786 0.6939740
## sample estimates:
##       cor 
## 0.6676665

It is clear that there is a very strong relationship between the two. Aside from TAC.acidity, this seemed to be the strongest bivariate relationship. Additionally, despite the telling name descriptions, the clear ‘floor’ on this graph hints that free.sulfur.dioxide is a subset of total.sulfur.dioxide.

Multivariate Plots

Multivariate Analysis

I primarily examined the 4 features which showed high correlation with quality. These scatterplots were a bit crowded, so I faceted by rating to illustrate the population differences between good wines, average wines, and bad wines. It’s clear that a higher citric acid and lower volatile (acetic) acid contributes towards better wines. Likewise, better wines tended to have higher sulphates and alcohol content. Surprisingly, pH had very little visual impact on wine quality, and was shadowed by the larger impact of alcohol. Interestingly, this shows that what makes a good wine depends on the type of acids that are present.

Final Plots and Summary

Plot 1: Effect of acids on wine quality

These subplots were created to demonstrate the effect of acidity and pH on wine quality. Generally, higher acidity (or lower pH) is seen in highly-rated wines. To caveat this, a presence of volatile (acetic) acid negatively affected wine quality. Citric acidity had a high correlation with wine quality, while fixed (tartaric) acid had a smaller impact.

Plot 2: Effect of Alcohol on Wine Quality

These boxplots demonstrate the effect of alcohol content on wine quality. Generally, higher alcohol content correlated with higher wine quality. However, as the outliers and intervals show, alchol content alone did not produce a higher quality.

Plot 3: What makes good wines, good, and bad wines, bad?

This is perhaps the most telling graph. I subsetted the data to remove the ‘average’ wines, or any wine with a rating of 5 or 6. As the correlation tests show, wine quality was affected most strongly by alcohol and volaticle acidity. While the boundaries are not as clear cut or modal, it’s apparent that high volatile acidity–with few exceptions–kept wine quality down. A combination of high alcohol content and low volatile acidity produced better wines.

Reflection

Through this exploratory data analysis, I was able to identify the key factors that determine and drive wine quality, mainly: alcohol content, sulphates, and acidity. It is important to note, however, that wine quality is ultimately a subjective measure, albeit measured by wine experts. That said, the correlations for these variables are within reasonable bounds. The graphs adequately illustrate the factors that make good wines ‘good’ and bad wines ‘bad’. Further study with inferential statistics could be done to quantitatively confirm these assertions.